{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YUCX_4AeCFrO"
   },
   "source": [
    "# Simulation on Orthogonal Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qlbQ0Tbcd2nB"
   },
   "source": [
    "We compare the performance of the naive and orthogonal methods in a computational experiment where\n",
    "$p=n=100$, $\\beta_j = 1/j^2$, $(\\gamma_{DW})_j = 1/j^2$ and $$Y = 1 \\cdot D + \\beta' W + \\epsilon_Y$$\n",
    "\n",
    "where $W \\sim N(0,I)$, $\\epsilon_Y \\sim N(0,1)$, and $$D = \\gamma'_{DW} W + \\tilde{D}$$ where $\\tilde{D} \\sim N(0,1)/4$.\n",
    "\n",
    "The true treatment effect here is 1. From the plots produced in this notebook (estimate minus ground truth), we show that the naive single-selection estimator is heavily biased (lack of Neyman orthogonality in its estimation strategy), while the orthogonal estimator based on partialling out, is approximately unbiased and Gaussian."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mS89_Re5ECjm"
   },
   "outputs": [],
   "source": [
    "# As before, we import clone hdmpy to use rlasso functions in python\n",
    "!pip install multiprocess\n",
    "!pip install pyreadr\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9b8DZ8SzCFrO"
   },
   "outputs": [],
   "source": [
    "import hdmpy\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "import statsmodels.api as sm\n",
    "import matplotlib.pyplot as plt\n",
    "from joblib import Parallel, delayed\n",
    "import warnings\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "G7t_ujgeG42T"
   },
   "outputs": [],
   "source": [
    "# Initialize constants\n",
    "B = 10000  # Number of iterations\n",
    "n = 100  # Sample size\n",
    "p = 100  # Number of features\n",
    "\n",
    "# Sim Parameters\n",
    "mean = 0\n",
    "sd = 1\n",
    "\n",
    "\n",
    "def exp(it):\n",
    "    np.random.seed(it)\n",
    "    # Generate parameters:\n",
    "    gamma = (1 / (np.arange(1, p + 1) ** 2)).reshape(p, 1)\n",
    "    beta = (1 / (np.arange(1, p + 1) ** 2)).reshape(p, 1)\n",
    "\n",
    "    # Generate covariates / random data\n",
    "    X = np.random.normal(mean, sd, n * p).reshape(n, p)\n",
    "    D = (X @ gamma) + np.random.normal(mean, sd, n).reshape(n, 1) / 4\n",
    "\n",
    "    # Generate Y using DGP\n",
    "    Y = D + (X @ beta) + np.random.normal(mean, sd, n).reshape(n, 1)\n",
    "\n",
    "    # Single selection method using rlasso\n",
    "    r_lasso_estimation = hdmpy.rlasso(np.concatenate((D, X), axis=1), Y, post=True)\n",
    "    coef_array = r_lasso_estimation.est['coefficients'].iloc[2:, :].to_numpy()\n",
    "    SX_IDs = np.where(coef_array != 0)[0]\n",
    "\n",
    "    # Check if any X coefficients are selected\n",
    "    if sum(SX_IDs) == 0:\n",
    "        # If no X coefficients are selected, regress Y on D only\n",
    "        Naive = sm.OLS(Y, sm.add_constant(D)).fit().params[1]\n",
    "    else:\n",
    "        # If X coefficients are selected, regress Y on selected X and D\n",
    "        X_D = np.concatenate((D, X[:, SX_IDs]), axis=1)\n",
    "        Naive = sm.OLS(Y, sm.add_constant(X_D)).fit().params[1]\n",
    "\n",
    "    # Double Lasso Partialling Out\n",
    "    fitY = hdmpy.rlasso(X, Y, post=True)\n",
    "    resY = fitY.est['residuals']\n",
    "\n",
    "    fitD = hdmpy.rlasso(X, D, post=True)\n",
    "    resD = fitD.est['residuals']\n",
    "\n",
    "    Orthogonal = sm.OLS(resY, sm.add_constant(resD)).fit().params[1]\n",
    "\n",
    "    return Naive, Orthogonal\n",
    "\n",
    "\n",
    "results = Parallel(n_jobs=-1, verbose=3)(delayed(exp)(it) for it in range(B))\n",
    "\n",
    "Naive, Orthogonal = zip(*results)\n",
    "Naive, Orthogonal = np.array(Naive), np.array(Orthogonal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7gdNn4psCFrS"
   },
   "outputs": [],
   "source": [
    "# Create a figure with two subplots side by side\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
    "\n",
    "# Plot a histogram for the 'Naive' estimates vector\n",
    "sns.histplot(Naive - 1, bins=20, kde=False, color='green', ax=axes[0])\n",
    "axes[1].set_title('Orthogonal', fontsize=14)\n",
    "axes[1].set_xlabel('Bias', fontsize=12)\n",
    "axes[1].set_ylabel('Frequency', fontsize=12)\n",
    "\n",
    "# Plot a histogram for the 'Orthogonal' estimates vector\n",
    "sns.histplot(Orthogonal - 1, bins=20, kde=False, color='blue', ax=axes[1])\n",
    "axes[0].set_title('Naive', fontsize=14)\n",
    "axes[0].set_xlabel('Bias', fontsize=12)\n",
    "axes[0].set_ylabel('Frequency', fontsize=12)\n",
    "\n",
    "# Add a title to the entire figure\n",
    "fig.suptitle('Distribution of Estimates (Centered around Ground Truth)', fontsize=16)\n",
    "\n",
    "# Display the histograms side by side\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "geB_BAaDq2cl"
   },
   "source": [
    "As we can see from the above bias plots (estimates minus the ground truth effect of 1), the double lasso procedure concentrates around zero whereas the naive estimator does not.\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
